Poly-agent RHQ

Motivation

Over the years, we've heard a lot of different kinds of complaints and suggestions for change of our current architecture.

One of the more frequent complaints was that RHQ requires an agent to be running on the managed machine and that that agent can become quite heavy.

This document tries to address both what we've called the agentless management and the modularity of the agent. To achieve true agentless monitoring while also keeping the goodness of our inventory model and rich metadata, this requires nontrivial changes to how we handle and categorize the incoming data.

Foundation

Today, we rely on a single (pluggable) source of management data (be it monitoring, configuration, whatever) and that is the RHQ agent.

It is understandable that this requirement is not always welcome by the userbase, especially in cases, where we don't have a plugin for the needed type of application. On the other hand, users most possibly already have some kind of monitoring of their app built - some would just use log parsing, some would have scripts to collect data, some would have stats available in JMX, whatever.

We try to cater for some of those usecases by providing the ability to "wrap" an RHQ plugin around those "means" - our script-server plugin for running external scripts, JMX base plugin for JMX-based solutions. We also support deploying "jarless" plugins - so all the users need can be as little as an XML file providing the necessary metadata for their management data.

Still we see a lot of complaints. Some people cannot live with the fact that they need the full blown RHQ agent running on their machine. Some people don't want to use half of the features we provide. For some people the need to think about and write custom XML file is just too much.

There are other problems that sometime crop up, too. One of them is what we call the dynamic resource types - a problem we already have a prototype solution for (although a bit dated and disliked by some). This ability (i.e. to create and change resource types programmatically, not by redeploying new and new versions of a plugin) comes very handy when dealing for example with JMX or when trying to integrate with some other management solution.

This document will try to address in some form all of the above outlined problems in an architecture that, while represents quite a departure from our current model, would I think provide a foundation more adept for integration with and by other appllications (which is the goal we set out to achieve lately).

All the text below assumes that the communication channel will solely be our REST API and that there is only a unidirectional communication from agents to the server.

Agent Capabilities and Identification

Be able to either switch off or declare what capabilities certain "agent" (the quotes to be explained further down) is the cornerstone of all further thoughts.

In a nutshell, if we want to allow other endpoints than RHQ agent to contribute to the collected management data, we need to know what such endpoints are capable of and we need to have a way of identifying them ( this also touches on the security and authentication of such "agents").

The main idea is each endpoint is considered a small "mini-agent" with a (limited) set of capabilities. These are not tied to platforms though. As explained further down being able to define a "platform" becomes one of the possible capabilities of the agent and is also a function available to the users.

Mini-agents contribute management data (monitoring or other) and it depends on their capabilities and a server-side "routing table" where such data end up in RHQ's inventory (see IncomingDataRouting).

Identification

Every endpoint (even if a simple python script) will have to be able to authenticate itself. Currently RHQ agents do that by the means of a security token.

The only change I am proposing here is to add ability for the RHQ admin to generate a new token (or the agent registration) upfront, manually, so that an initial handshake between a new agent and the server isn't necessary. This is to support a case of a simple bashscript pushing data to our REST iface using curl and having the token hardcoded in the script itself.

Capabilities

Capabilities define, as the name would suggest, what the (mini-)agents are capable of doing.

They do not need to be expressed explicitly anywhere else than in the signature of methods of our REST API (i.e. an agent incapable of configuration management will never call a configuration-related method ).

The list of capabilities roughly follows the current agent plugin facets but enriches it with a couple of more abstract concepts (this list is THE subject to change as we go develop it):

discovery - the agent is able to discover resources. This does not mean, it will have to know their types though.
resource-type-aware - this means that the agent is aware of the type of the data it is reporting (or rather RHQ's representation of it)
resource-type-defining - this agent can define a resource type. It is by default assumed that the resource types are DynamicResourceTypes.
resource-targetting - this means the agent is aware of individual resources and is able to manage them separately. If the agent is not resource aware we put no assumptions on what the reported data represent and if it only relates to a single "resource" on the server or not. I.e. if it reports 2 metrics, A and B, it is entirely in the hands of the routing table where in the server-side inventory those metrics will end up - even 2 different resources.
platform-defining - not sure if this is even needed. If an agent is resource-type-aware, discovery and resource-type-defining and it happens to discover a resource of a platform type, we're done here.
schedulable - the agent is able to run "stuff" on a schedule rather than anytime it feels like it.
availability - the agent is capable of reporting the availability. If it is not resource-targetting then it basically reports availability of "itself" whatever that may mean.
metrics - the agent is capable of collecting ordinary or "calltime" metrics.
configurable - the agent can be configured (i.e. can be supplied a pluginConfiguration for itself or its resources if it's resource-targetting, too).
configuration - the agent can configure the resources (i.e. can read and update the resourceConfiguration).
operations - the agent can invoke operations
provisioning - the agent can provision stuff into "itself" or individual resources
drift - the agent can do configuration drift management
events - the agent can collect events
upgrading - can "lift" resources defined under old resource type definition to a new one after an upgrade of the resource type occured

Incoming Data Routing

On the server-side, we are still going to have our good old inventory with its platforms and resources. But because we are loosening the restrictions on what the "agents" are required to send up to the server, we also need a way of routing a potentially ambiguous information from the agents to the correct target place in the inventory.

I envision following workflows:

resource unaware agents:
1. The agent pops up in a discovery queue of sorts
2. The admin selects or creates a resource anywhere in the existing inventory that will receive further updates from the agent. "Creating a resource" is potentially a complicated process of manual definition of a new resource type, if none of the already existing resource types fits. This also potentially involves mapping metric names, etc.
resource-aware agents:
1. This works exactly the same way as today with the exception that the newly discovered top-level resources might not be put under a platform because it might not be known to the agent. Rather we would restructure the information of "under this platform we discovered this and that" to "this agent discovered this and that and thinks it might belong here and there, is that OK?".

Note that a resource can be backed by data from several agents and vice versa - several resources could be (partially) backed by data from single agent. This is a consequence of dynamic formation of resources from incoming data as described above.

Agentless Management

We have already started to do this. IMHO, we just need to drop our current "proprietary" agent-server comms and design a proper public REST API that any agent-like endpoint can use to push data into RHQ.

The AgentCapabilitiesandIdentification section has already explained a large part of this. Basically, an agent becomes a very loose term that could probably be better called an "authenticated data source". It doesn't matter what it is, where it is or how much it can do. It just pushes data into RHQ and server-side mechanisms then decide where the information will end up inside RHQ.

Dynamic Resource Types

The corner stone of RHQ today and in future is our inventory model. That is the way we represent the "real world" out there.

This model is based on the notion of resource types that describe a manageable "thing" (resource) - i.e. its configuration properties, metrics being collected, operations that can be invoked, etc.

This cannot go away. Our model is quite unique in the amount of metadata we can provide (and reason about).

What we need to change though is the rigidity with which we deal with resource types. We already have mechanisms to update resource types and while I do not know the extent to which we support resource type changes we should push that ability to the limits - adding or removing metrics and changes in configuration definitions should be non-destructive (I don't actually know what we can do today).

But that's not all. I already mentioned above that we should provide a user with the ability to craft their own resource types (both in UI and using the REST API). This also includes expanding on the existing resource types (think of a user defining a new JMX-based resource type for their application - i.e. pick and choose what MBeans I am interested in). This could be enhanced by for example a new server plugin type - resource type builders - that would assist the user with the creation of resource types using some sort of sniffing on some "example" target (i.e. user provides connection to an MBean server and the builder would offer him all the mbeans there interactively so that the user doesn't have to provide full object names).

In the past I also experimented with an annotation processor that would, at compile time, generate an RHQ plugin that would enable seamless integration of any java-based software with RHQ (btw. the arch was such that the code would execute in-process of the managed software (most of it), so that our recent problems around privileges required for agents would be solved in this case). I put this to rest unfinished though (https://github.com/metlos/rhq-plugin-annotation-processor).

Polyglot

Being able to write plugins in multiple languages, not just java has become a must-have feature in recent years.

While there has already been some thoughts about bringing other languages to the RHQ agent (which has, imho a non-trivial cost both in terms of required memory and also the speed of execution (even though Java7 and 8 provide some significant boosts in possible performance)), I think it is more important to be able to push data into RHQ easily and without an RHQ agent proper. This by consequence enables any language capable or JSON and HTTP connection to write your "plugin" in.

If we must think about supporting other languages in RHQ agent proper then I think the first step in that would be integration of our script server plugin (that call executables to gather info) into our plugin API - i.e. update our plugin descriptor and plugin container to enable this kind of functionality out of the box without a need to go through an intermediary.

Only then would I think about bundling script engine implementations with our agent that would enable running other languages in the agent's JVM.